The aim of this script is exploratory. I understand that more analysis must be conducted in order to prove that size matters in batting. I chose three variables that are related to performance. Mainly, runs batted in (rbi), home runs (hr) and salary.

I start by loading the necessary packages.

library(ggplot2)
library(dplyr)
library(plotly)
library(DT)

Then I load the regular season data for batters along with the salary and player information.

rsbatters <- read.csv("batting.csv", stringsAsFactors = FALSE)
playerinfo <- read.csv("player.csv", stringsAsFactors = FALSE)
salary <- read.csv("salary.csv", stringsAsFactors = FALSE)

Note: while the rsbatters and playerinfo data frames go from 1925-2015 the salary data frame is only available for 1985-2015.

I attach the player information to both data sets by merging the data sets.

Databatters <- left_join(rsbatters,playerinfo, by="player_id")
Datasalary <- left_join(salary,playerinfo, by="player_id")

Height and Weight from 1925-2015.

Here I limit the sample to after 1925 since that is when data is available for both variables. I group by year and then find the average height and weight of the players for every year. Note: In the data set, a single player’s height and weight does not vary throughout his career

Databatters %>%
  group_by(year) %>%
  filter(year>1925) %>%
  summarize(aweight = mean(weight), aheight=mean(height)) ->Databatters1

Next I used two plots to see if the players have been increasing in size. I used ggplot to make the graph, and plotly for interactivity.

# Weight by Years
a <- ggplot(data = Databatters1, aes(year, aweight)) + geom_point(color="blue3") + labs(title="Average Weight by Year 1925-2015",x="YEAR", y="AVERAGE WEIGHT")
ggplotly(a)
# Height by Years
b <- ggplot(data = Databatters1, aes(year, aheight)) + geom_point(color="blue3") + labs(title="Average Height by Year 1925-2015",x="YEAR", y="AVERAGE HEIGHT")
ggplotly()

The first plot shows that the height of the MLB players has been steadily increasing during the years. The weight also increases during the period but somewhere around 1984 there is an inflection point and the growth rate of MLB players weight seems to increase at a faster rate. I bet we all might have a good guess as to why… Notice also how the sharp decline in weight sometime after 2011.

Size and Performance (RBI’s and Home Runs)

Next I wanted to see graphically if weight and height mattered in performance. I restrict the graphs to players that have played more than 100 games and remove any missing data. Given that the height and weight does not vary in the “player.csv” file, I group by first and last name and calculate average career statistics for each player.

Databatters %>%
  select(player_id,name_first,name_last,rbi,hr,height,weight,year,g) %>%
  filter(height>50,year>=1925,g>=100) %>%
  group_by(player_id,name_first,name_last) %>%
  summarize(arbi=mean(rbi), ahr=mean(hr), aheight=mean(height), aweight=mean(weight)) -> Databatters2
  
Databatters2 <- na.omit(Databatters2)

The code here is long but I am mainly changing the appearance of the ggplot and adding lines that identify the average height and weight of the players.

# Size and RBI

ggplot(data = Databatters2, aes(aheight, aweight)) + geom_point(aes(color =arbi))+
geom_hline(yintercept= mean(Databatters2$aweight,na.rm=TRUE))+ geom_vline(xintercept= mean(Databatters2$aheight,na.rm=TRUE)) + theme(panel.background = element_rect(fill = "grey")) + labs(title="REGULAR SEASON RBI 1925-2015", x="HEIGHT", y="WEIGHT")+scale_colour_gradientn(colours=topo.colors(10))

# Size and HR
ggplot(data = Databatters2, aes(aheight, aweight)) + geom_point(aes(color = ahr))+
geom_hline(yintercept= mean(Databatters2$aweight,na.rm=TRUE))+ geom_vline(xintercept= mean(Databatters2$aheight,na.rm=TRUE)) +theme(panel.background = element_rect(fill = "grey")) + labs(title="REGULAR SEASON HR 1925-2015", x="HEIGHT", y="WEIGHT")+scale_colour_gradientn(colours=topo.colors(10)) 

The lower right rectangle identifies “small” players while the upper right on identifies “big” players (as determined by the mean). Although, size seems to not matter much in rbi’s, it does seem to have an impact on homeruns (more so weight). Here is a table to search for players highlighted in the scatterplots.

datatable(Databatters2, class='compact')

Size and Salary (1985-2015)

Finally I want to see if size affect the salary a player will earn.

Datasalary %>%
  select(player_id,name_first,name_last,height,weight,year,salary) %>%
  filter(height>50,year>=1925) %>%
  group_by(player_id,name_first,name_last,year) %>%
  summarize(asalary=mean(salary), aheight=mean(height), aweight=mean(weight)) ->Databatters3

I use the consumer price index in order to adjust salary for inflation.

#CPI data. Source:https://research.stlouisfed.org/fred2/series/CPIAUCSL/downloaddata
library(DataCombine)

year<-seq(from=1984, to=2015, by=1)
cpi<-c(102.100,105.700,109.900,111.400,116.000,121.200,127.500,134.700,138.300
       ,142.800,146.300,150.500,154.700,159.400,162.000,164.700,169.300,175.600
       ,177.700,182.600,186.300,191.600,199.300,203.437,212.174,211.933,217.488
       ,221.187,227.860,231.641,235.436,234.954)
cpidat<-data.frame(year,cpi)
cpidat<- slide(cpidat, Var = "cpi", slideBy = -1)
cpidat$reindex<-cpidat$cpi/cpidat$`cpi-1`
cpidat<-na.omit(cpidat)
Databatters3<-merge(Databatters3,cpidat,by="year")
Databatters3$asalaryadj<-Databatters3$asalary/Databatters3$reindex
Databatters3$short<-as.numeric(Databatters3$aheight<=mean(Databatters3$aheight))
Databatters3$light<-as.numeric(Databatters3$aweight<=mean(Databatters3$aweight))

Databatters3 %>%
  group_by(year,short)%>%
  summarize(asalaryadj=mean(asalaryadj))->Databatters4
e<- ggplot(data = Databatters4, aes(year, asalaryadj)) + geom_line(aes(color =short))+theme(panel.background = element_rect(fill = "grey"))+labs(title="Average Real Salary 1985-2015 (short vs. tall",x="YEAR", y="AVERAGE REAL SALARY")
ggplotly(e)
Databatters3 %>%
  group_by(year,light)%>%
  summarize(asalaryadj=mean(asalaryadj))->Databatters5
f<- ggplot(data = Databatters5, aes(year, asalaryadj)) + geom_line(aes(color =light))+theme(panel.background = element_rect(fill = "grey"))+labs(title="Average Real Salary 1985-2015 (light vs. heavy",x="YEAR", y="AVERAGE REAL SALARY")
ggplotly(f)

It is clear that the average real salaries of batters has gone up during the period. “Short” players have had a lower salary for most of the period (except for a brief period between 1995-1999). “Light” players on the other hand have had higher salaries for the period 1993-2003. Only recently (2004-2015) have heavy players enjoyed a significantly higher real salary.

# Size and Salary
ggplot(data = Databatters3, aes(aheight, aweight)) + geom_point(aes(color = asalaryadj))+
geom_hline(yintercept= mean(Databatters3$aweight,na.rm=TRUE))+ geom_vline(xintercept= mean(Databatters3$aheight,na.rm=TRUE)) +theme(panel.background = element_rect(fill = "grey")) + labs(title="REGULAR SEASON AVERAGE SALARY 1985-2015", x="HEIGHT", y="WEIGHT")+scale_colour_gradientn(colours=topo.colors(10))

This last scatter plot suggests that “lighter players” (weighing less than 185) might get paid less than heavier ones. Finally here is the table to search for the sepecific players.

datatable(Databatters3, class='compact')

OLS Coefficients

Below is some code for plotting the OLS coefficients.

library(arm)
fit1=lm(ahr~aheight+aweight,data=Databatters2)
coefplot(fit1, col.pts="blue",main="OLS Coefficients of avg. hr = avg. height + avg. weight",cex.pts=1.5,varnames=c("Intercept","Avg. Weight","Avg. Height"))

fit2<-lm(arbi~aheight+aweight,data=Databatters2)
coefplot(fit2, col.pts="blue",main="OLS Coefficients of avg. rbi = avg. height + avg. weight",cex.pts=1.5,varnames=c("Intercept","Avg. Weight","Avg. Height"))

fit3<-lm(asalaryadj~aheight+aweight, data=Databatters3)
coefplot(fit3, col.pts="blue",main="OLS Coefficients of avg. salaryadj = avg. height + avg. weight",cex.pts=1.5,varnames=c("Intercept","Avg. Weight","Avg. Height"))

Notice the large variance for the height coefficient. This is expected as there is small variation in the height of players (recall that height does not increase much over time). The first two regressions suggest a direct impact of height and weight on average rbi’s and hr’s.
The salary regression suggests that weight directly affects salary while height inveresly affects salary.

*Note: I understand that OLS assumptions (linearity, homoscedasticity, variable omission, etc.) might be violated and coefficient estimates might biased. This is still work in progress…

Preliminary Conclusions

  1. Heavier players hit more rbi’s and hr’s than lighter players.
  2. Taller players hit more rbi’s and hr’s than shorter players.
  3. Weight increases salary while height decreases salary.